After building production multi-agent pipelines with Claude as the primary reasoning engine, I've found a set of patterns that rarely appear in the official docs. The current lineup is Opus 4.8 (launched May 28, 2026), Sonnet 4.6, and Haiku 4.5 — all with 1M token context windows. Here is what actually matters for keeping costs down and pipelines stable.
1. The Claude Model Lineup in June 2026
Opus 4.8 pricing at $5/$25 is a 67% reduction from the Opus 4/4.1 era ($15/$75). The Batch API gives an additional 50% off across all models for async work. Cache reads are billed at roughly 10% of the standard input rate. Stack these correctly and your per-run cost drops dramatically.
2. Prompt Caching: The Biggest Cost Lever
Prompt caching lets Claude reuse the beginning of a request if that prefix is identical to a previously cached version. Cache reads cost roughly 10% of standard input tokens. On a pipeline with a 50K-token shared codebase context passed to 6 agents, this drops input costs from ~$2 per run to ~$0.20.
As of February 5, 2026, caching uses workspace-level isolation. Caches are scoped per workspace, not per organization — relevant if you share an org with multiple teams.
response = client.messages.create(
model="claude-opus-4-8-20260528",
max_tokens=4096,
system=[
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT, # 50K tokens of stable context
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": dynamic_user_message}
]
)
# Check cache status
cache_creation = response.usage.cache_creation_input_tokens
cache_read = response.usage.cache_read_input_tokens
3. The Two Cache TTLs (and Why 1 Hour Matters)
Anthropic now offers two TTL options: 5 minutes (default) and 1 hour. The 1-hour cache costs 2x a cache write but is usually the right default for agent workloads. Extended thinking tasks can take longer than 5 minutes to complete, meaning a 5-minute cache evicts before the next agent turn even starts. For long-running pipelines, always request the 1-hour TTL.
system=[
{
"type": "text",
"text": STABLE_CONTEXT,
"cache_control": {
"type": "ephemeral",
"ttl": 3600 # 1 hour instead of 5 minutes
}
}
]
4. Cache Invalidation: The Traps That Kill Your Hits
Cache hits require an identical prefix. Anything that changes the prefix between requests invalidates the cache silently. The most common trap: putting a timestamp in your system prompt.
Other things that invalidate the prefix even when your actual task prompt is identical: changing tool definitions, toggling extended thinking on or off between requests, adding or removing images, and changing tool_choice settings. Pick settings up front and keep them stable per conversation.
5. Tool Caching: Where to Put cache_control
When you pass a tools array, put cache_control on the last tool in the array. Claude caches the prefix up to and including that marker. If you have more than 15-20 tools, consider deferred tool loading from the start — both for caching efficiency and for model performance, since the model reasons over all tool definitions on every turn.
tools = [
{
"name": "search_web",
"description": "...",
"input_schema": {...}
},
{
"name": "read_file",
"description": "...",
"input_schema": {...},
"cache_control": {"type": "ephemeral"} # on the LAST tool
}
]
# cache_control on the last tool caches the entire tools prefix
# stable tools array = cache hits on every call
6. Extended Thinking: budget_tokens is Deprecated
Extended thinking is available on Opus 4.8 and Sonnet 4.6. The API changed on Opus 4.7+: budget_tokens is deprecated. The current parameter is effort, which takes "low", "medium", or "high" instead of a token count. Anthropic manages the allocation internally.
# OLD (deprecated on Opus 4.7+, will break)
thinking={
"type": "enabled",
"budget_tokens": 10000 # no longer accepted
}
# CURRENT (Opus 4.8, Sonnet 4.6)
response = client.messages.create(
model="claude-opus-4-8-20260528",
max_tokens=16000,
thinking={
"type": "enabled",
"effort": "high" # low | medium | high
},
messages=[{"role": "user", "content": complex_audit_prompt}]
)
for block in response.content:
if block.type == "thinking":
pass # internal reasoning chain
elif block.type == "text":
final_answer = block.text
Important: toggling thinking on and off between turns invalidates prompt caching for the message history. Decide at the conversation level whether thinking is on, and keep it consistent throughout that conversation.
Thinking blocks get cached as part of the request content when you pass them back in tool use conversations. During tool use, return thinking blocks to the API unmodified along with your tool result.
7. Tool Use That Does Not Break Agents
Three rules that prevent infinite tool loops in multi-agent systems:
- Always return a tool_result, even on error. If a tool fails, return a tool_result block with is_error: true. Claude adapts. Skip the result and Claude thinks it is still waiting and will loop.
- Set max_tokens explicitly. A model that runs out of tokens mid-tool-call returns a malformed response that breaks your parser downstream.
- Scope tools per agent role. If you expose 20 tools, the model reasons over all 20 every turn. Give the Researcher search tools. Give the Auditor code analysis tools. Not all 20 to everyone.
tool_result = {
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": "Error: file not found at path ./src/main.js",
"is_error": True # tells Claude this call failed
}
messages.append({"role": "user", "content": [tool_result]})
8. Context Distillation Between Agents
All current Claude models have a 1M token context window. That does not mean you should use it. In a 6-agent pipeline where each agent passes its full conversation to the next, you are burning tokens and slowing every call. The correct pattern: extract only the structured output at each step and pass that forward.
const researchBrief = await researcher.run(goal);
const architectPlan = await architect.run({
goal,
research: researchBrief.summary, // ~500 tokens
sources: researchBrief.keyFindings // ~300 tokens
// NOT: researchBrief.fullConversation // 40,000 tokens
});
9. Model Tiering: Use Haiku Where You Do Not Need Opus
Use Haiku 4.5 for routing and classification (which agent handles this? does this input look valid?). Use Sonnet 4.6 for the main agentic work. Reserve Opus 4.8 for the steps that genuinely need deep reasoning: architecture planning, multi-pass security audit, complex code generation. Switching from all-Opus to tiered routing cuts per-run LLM costs by 60% in most pipelines.
10. Batch API for Non-Urgent Work
The Batch API processes requests asynchronously within 24 hours at a flat 50% discount on all input and output tokens. Documentation generation, data classification, evaluation runs, pre-computed analysis — anything that does not need a real-time response should go through the Batch API.
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"doc-{i}",
"params": {
"model": "claude-haiku-4-5-20251001",
"max_tokens": 1024,
"messages": [{"role": "user", "content": doc_prompt}]
}
}
for i, doc_prompt in enumerate(doc_prompts)
]
)
# Poll or webhook when complete
results = client.messages.batches.results(batch.id)